SYS 6018 | Spring 2021 | University of Virginia
Tell the reader what this project is about. Motivation.
Load data, explore data, etc.
# Load Required Packages
library(tidyverse)
library(pROC)
library(randomForest)
library(GGally)
library(gridExtra)
library(plotly)haiti <- read_csv("HaitiPixels.csv")
print(dim(haiti))#> [1] 63241 4
head(haiti)#> # A tibble: 6 x 4
#> Class Red Green Blue
#> <chr> <dbl> <dbl> <dbl>
#> 1 Vegetation 64 67 50
#> 2 Vegetation 64 67 50
#> 3 Vegetation 64 66 49
#> 4 Vegetation 75 82 53
#> 5 Vegetation 74 82 54
#> 6 Vegetation 72 76 52
The dataframe contains 4 columns, and 63,241 rows. The Class column contains the correct label for the observation. Red, Green and Blue parameters are NEED TO INCLUDE CORRECT DEFINITION
To prepare the data for exploratory data analysis I must make Class a factor.
haiti %>%
mutate(Class = factor(Class)) #> # A tibble: 63,241 x 4
#> Class Red Green Blue
#> <fct> <dbl> <dbl> <dbl>
#> 1 Vegetation 64 67 50
#> 2 Vegetation 64 67 50
#> 3 Vegetation 64 66 49
#> 4 Vegetation 75 82 53
#> 5 Vegetation 74 82 54
#> 6 Vegetation 72 76 52
#> 7 Vegetation 71 72 51
#> 8 Vegetation 69 70 49
#> 9 Vegetation 68 70 49
#> 10 Vegetation 67 70 50
#> # ... with 63,231 more rows
haiti %>%
group_by(Class) %>%
summarize(N = n()) %>%
mutate(Perc = round(N / sum(N), 2) * 100)#> # A tibble: 5 x 3
#> Class N Perc
#> * <chr> <int> <dbl>
#> 1 Blue Tarp 2022 3
#> 2 Rooftop 9903 16
#> 3 Soil 20566 33
#> 4 Various Non-Tarp 4744 8
#> 5 Vegetation 26006 41
The records are not evenly distributed between the categories. Of the Classes Blue Tarp, our “positive” category if we are thinking a binary positive/negative identification, is only 3% of our sample. Soil and Vegetation make up the majority of our sample at 74%.
It will be interesting to see performance predicting each of these categories, or a binary is or is not Blue Tarp.
Create a DataFrame that is only Blue Tarp, or not Blue Tarp: * 0 == Not a Blue Tarp * 1 == Is a Blue Tarp
haitiBinary <- haiti %>%
mutate(ClassBinary = if_else(Class == 'Blue Tarp', '1', '0'), ClassBinary = factor(ClassBinary))haitiBinary %>%
group_by(ClassBinary) %>%
summarize(N = n()) %>%
mutate(Perc = round(N / sum(N), 2) * 100)#> # A tibble: 2 x 3
#> ClassBinary N Perc
#> * <fct> <int> <dbl>
#> 1 0 61219 97
#> 2 1 2022 3
redplot <- ggplot(haiti, aes(x=Class, y=Red)) +
geom_boxplot(col='red')
greenplot <- ggplot(haiti, aes(x=Class, y=Green)) +
geom_boxplot(col='darkgreen')
blueplot <- ggplot(haiti, aes(x=Class, y=Blue)) +
geom_boxplot(col='darkblue')
grid.arrange(redplot, greenplot, blueplot)redplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Red)) +
geom_boxplot(col='red')
greenplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Green)) +
geom_boxplot(col='darkgreen')
blueplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Blue)) +
geom_boxplot(col='darkblue')
grid.arrange(redplotB, greenplotB, blueplotB) Box Plot Comments
“Blue Tarp” as the “positive” result, and other results as the “negative” result.
Regarding the box plot of the five categories, of interest is that “Soil” and “Vegetation” are relatively unique in their non-outlier RGB values. “Rooftop” and “Various Non-Tarp” are more similar in their RBG values.
If the classes are collapsed to binary values of “Blue Tarp (1)” and “Not Blue Tarp (0)” there is little overlap in the blue values for the two classes, and the ranges of red and green are much smaller for blue tarp than non-blue-tarp.
Generally, the values of red have a larger range for negative results than for positive results, and the positive results have a similar median to the negative results. Green values have a larger range for negative results than for positive results, and the positive results have a higher median than the negative results, and there is almost no overlap in the blue data with non-blue tarps, and blue tarps.
These correlations make sense as the pixels were of highly saturated colors, that are not pure Blue, Red or Green. There are few pixels in the data set with low values for R,G,B.
ggpairs(haiti[-1], lower = list(continuous = "points", combo = "dot_no_facet"), progress = F) ### 3-D Scatterplot
To view the relationship between the Red, Green, and Blue values between the five classes, and the binary classes, an interactive 3-D scatter plot is extremely useful.
References https://plotly.com/python/3d-scatter-plots/
https://plotly.com/r/figure-labels/
The scatter plot displays
fiveCat3D = plot_ly(x=haiti$Red, y=haiti$Blue, z=haiti$Green, type="scatter3d", mode="markers", color=haiti$Class, colors = c('blue2','azure4','chocolate4','coral2','chartreuse4'),
marker = list(symbol = 'circle', sizemode = 'diameter', opacity =0.35))
fiveCat3D = fiveCat3D %>%
layout(title="5 Category RBG Plot", scene = list(xaxis = list(title = "Red", color="red"), yaxis = list(title = "Blue", color="blue"), zaxis = list(title = "Green", color="green")))
fiveCat3DThe 3D scatter plot is particularly useful because, by zooming in, one can see that there is a space in the 3D plot with significant mingling of “blue tarp” pixels and other pixel categories. That area of the data will provide a challenge for our model.
binary3D = plot_ly(x=haitiBinary$Red, y=haitiBinary$Blue, z=haitiBinary$Green, type="scatter3d", mode="markers", color=haitiBinary$ClassBinary, colors = c('red','blue2'),
marker = list(symbol = 'circle', sizemode = 'diameter', opacity =0.35))
binary3D = binary3D %>%
layout(title="Binary RBG Plot", scene = list(xaxis = list(title = "Red", color="red"), yaxis = list(title = "Blue", color="blue"), zaxis = list(title = "Green", color="green")))
binary3DSimilar to the five category 3D scatter plot, the binary scatter plot shows distinct groupings for blue tarp and non-blue-tarp. As expected, there is mingling of blue tarp and non-blue-tarp pixels that will provide a challenge for a model.
Normalization does not need to be considered because the ranges of Red, Green and Blue are the same.
The DataFrame must be divided into training and test data sets; however, the data set is unbalanced with few “positive” results, i.e. “Blue Tarp”, compared with negative results.
How were tuning parameter(s) selected? What value is used? Plots/Tables/etc.
NOTE: PART II same as above plus add Random Forest and SVM to Model Training.
** CV Performance Table Here**
Load data, explore data, etc.
Hold-Out Performance Table Here